Information about verb subcategorization frames (SCFs) is important to
many tasks in natural language processing (NLP).  Biomedical text, a
major target for applications such as information extraction, is
composed of subdomains with different textual characteristics, and is
growing at an exponential rate.  Given the difficulty of manual domain
adaptation, it is essential to understand the issues involved in
automatic SCF acquisition for biomedicine and its subdomains.  This
paper considers three fundamental issues that arise: the performance of current
acquisition technology on biomedical text, the definition of
subcategorization in biomedical text, and the degree of SCF
variation at the level of biomedical subdomains.

First, we produce an SCF gold standard and evaluate two SCF lexicons, one from an existing
automatically-acquired biomedical resource 
% developed with NLP tools
% adapted for particular biomedical subdomains 
and the other automatically acquired for this paper from the entire PubMed Open Access collection (PMC OA) using SCF acquisition methods developed for general
language.
We find that a resource built using tools tuned for a subdomain of
biomedicine has greater precision, but a resource built using
general-language tools with a large biomedical corpus has better
coverage of SCFs that may be important for information extraction.  We
release the full unfiltered lexicon acquired from PMC OA as a
resource. 
Second, we investigate the implications of two definitions of
subcategorization, based on whether the argument-adjunct distinction
is maintained, by 
producing
% constructing 
and comparing two gold standards.
Evaluation of an acquired SCF lexicon against these gold standards reveals major performance
differences depending on the definition of subcategorization.
% automatically
% acquire an SCF lexicon from the entire PubMed corpus (OpenPMC) using
% current SCF acquisition methods developed for general language, and
% evaluate it along with an existing biomedical SCF resource.  
Finally, we quantify and present variation in SCF behavior
between subdomains of biomedicine.  We find significant variation,
which implies that resources targeted to a narrow subset of
biomedicine will not generalize to the entire domain. From the results
of these investigations we argue for minimally-supervised automatic
SCF acquisition methods as a way forward.
